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Abstract—Aiming at the exponential growth of solution scale in multiple 
hypothesis tracking (MHT), a continuous consistency model (CCM) is 
proposed. The key to improve MHT performance is to improve the effi- 
ciency of branch management. However, due to the inevitable detector 
failure, when the tree is expanded and each detection is organized as 
the root node of the new tree, a large number of virtual nodes are 
used. This leads to rapid growth of branches. Different from previous 
MHT implementations, CCM divides detection into four categories, in- 
cluding continuous, left continuous, right continuous and discontinuous. 
Comparative experiments show that CCM has significantly improved the 
computational efficiency and obtained the most advanced results on 
MOT challenge benchmark. 


Index Terms—Visual Tracking, Multiple Hypothesis Tracking, Data As- 
sociation 


1 INTRODUCTION 


Tracking multiple targets has been an important topic in the 
field of computer vision. Although significant progress has 
been made, there are still many tough unsolved problems. 
Similar appearances, frequent occlusions and motion blur 
(camera shakes), for instance, are some common obstacles 
for tracking. 

Multiple hypothesis tracking (MHT) [1] is one of the 
most successful frameworks to solve these problems. It 
keeps trees of hypotheses for targets and evaluates the 
likelihood of each branch to select the most likely one. 
However, exponential growth of the solution space is the 
critical defect of MHT. The scale and computational com- 
plexity grow dramatically along with the number of frames 
and detections. There are some strict but compromising 
constraints to control its growth, such as the number of the 
hypothesis trees, the number of the leaf nodes, the maximal 
number of dummy nodes. In addition, an iterative updating 
technique [2] is applied for speeding up, but it does not 
address the fundamental issue of the growth. 

A main reason of the rapid growth is the strategy that 
every leaf node is extended with a dummy node, and every 
detection is built as the root node of a new tree, as shown 
in Figure 1(a). In this paper, we propose a novel continuous 
consistency model (CCM) to characterize the relationship 
between targets in adjacent frames. CCM categorizes de- 
tections into four typical types including continuous, left 
continuous, right continuous and discontinuous. As shown 
in Figure 1(b), it reduces the scale of tracking trees by 
controlling hypothesis generation process. In addition, we 
remove the impractical constraints on dummy nodes by 
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Fig. 1. (a) and (b) show the difference between MHT with and without 
CCM during the tree generation process. Each circle represents a 
detection node in the tracking trees. Dummy nodes are shown in red 
dashed circles to represent potential missing detections. 


exploiting the CCM. As a result, our method significantly 
improves the computational efficiency while achieving bet- 
ter tracking performance. Comparative experiment results 
show our method is effective in restraining the exponential 
growth of MHT and has strong ability to reduce the risk 
of identity switch caused by long-term occlusions. The 
contributions of this paper are: 


e A novel continuous consistency model (CCM) to 
classify detections into four typical categories and 
describe the correlation consistency among detec- 
tions; 

e A method to restrain the exponential growth of MHT 
by exploiting CCM and thus significantly reducing 
the computational time; 

e Removing the constraints on the number of dummy 
nodes and reducing the ID switch errors. 


The rest of the paper is organized as follows. Related 
work is discussed in Sec.II. Our novel continuous con- 
sistency model is described in Sec.. Optimization for 
multiple hypothesis tracking with CCM is presented in 
Sec.IV. Experiment results are shown in Sec.V followed by 
conclusion in Sec. VI. 


2 RELATED WORK 


Tracking multiple targets has became one of the most 
popular topics in computer vision. It is a technique to 
locate targets in every single frame and recover trajectories 
through the whole video. There are generally two types of 
tracking approaches, online tracking and offline tracking. 
Online trackers [3]-[5] use past and current observations 
to generate trajectories. They provide strong real-time per- 
formance to fit requirements in real applications. However, 
early errors cannot be revised in these trackers. On the other 
hand, offline trackers [1], [6]-[8] consider all observations 
in a batch of frames or even the entire video. In this section, 
we will review some typical related work. 

Tracking-by-detection is an acknowledged framework 
for multi-target tracking. It regards tracking as a data asso- 
ciation problem. Most of the recent successful trackers are 
developed based on this method. By obtaining separate ob- 
servations from the detector, the main task of tracking turns 
to building associations and constraints between detections. 
Towards this end, various approaches are proposed. 

Zhang et al. [9] proposed a network flow based opti- 
mization method for tracking. It mapped data association 
problem into a cost-flow network with a non-overlaping 
constraint on trajectories and found the optimal solution by 
a min-cost flow algorithm. Later, Pirsiavash et al. [10] ana- 
lyzed the number of tracks as well as their birth and death 
states and gave global solution with a greedy algorithm. 
Butt et al. [11] incorporated higher-order track smoothness 
constraints into tracking. Unlike previous methods, a node 
in their network represents a candidate pair of matching 
observations between consecutive frames. However, such 
a formulation cannot be solved by min-cost network flow 
algorithm. As a result, they proposed an iterative solution 
using Lagrangian relaxation. Chari et al. [12] added pair- 
wise costs to the min-cost network flow framework and 
designed a convex relaxation solution to solve this NP- 
hard problem. Dehghan et al. [13] presented a new Target 
Identity-aware Network Flow (TINF) where the detection 
and data-association are performed simultaneously. They 
used structured learning method to learn a model for each 
target and to infer the best locations of all targets. To better 
cope with long term occlusions, McLaughlin et al. [14] 
added special edges to the tracking graph based on a motion 
model. These edges linked distant tracklets based on motion 
similarity. Later, Wang et al. [15] made it possible to track 
occluded objects by using the presence of other objects 
that contain them. But occlusion is not the only cause of 
detector failure, illumination or gesture changes can also 
cause detection failures. 

In addition to these network flow based trackers, Milan 
et al. have made several works on tracking for years as 
follows. In [16], they formulated multi-target tracking as 
minimization of a continuous energy function and con- 
structed an optimization scheme to find strong local minima 
of the energy. Later in [17], a discrete-continuous optimiza- 
tion problem was proposed to handles each aspect in its 
natural domain, such as target dynamics, mutual exclusion 
and track persistence. Data association was performed using 
discrete optimization while trajectory estimation is posed 
as a continuous fitting problem. In subsequent work, they 
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also began to consider occlusion in the tracking problem. 
As a result, a conditional random field (CRF) was built in 
[18]. It concerned both the conflict between observations 
and the exclusion between trajectories. An expansion move- 
based MAP estimation scheme was proposed to solve the 
CRF problem. In order to reduce the impact of occlusion, 
they took superpixel information into account [19]. Every 
superpixel was assigned to a specific target or classified as 
background. However, the accuracy of superpixel segmen- 
tation and classification drops dramatically when the scene 
gets complicated. 

More recently, the multi-cut based trackers have shown 
impressive results. Tang et al. [7], [20], [21] linked and 
clustered plausible detections jointly across space and time 
and thus stated the multi-target tracking as a minimum 
cost subgraph multi-cut problem. Although they employed 
a feasible optimization algorithm to solve the problem, it 
still suffered from its low efficiency. 


Multiple Hypothesis Tracking (MHT) is another tradi- 
tional method for tracking. MHT was first developed for 
sensor systems by Reid [22]. It measures the probabilities 
of each branches and unlikely hypotheses are eliminated 
while the trees are growing. Cox et al. [23] proposed an 
efficient implementation and suggested that it is possible to 
use MHT for visual tracking. Papageorgiou et al. [24] de- 
scribed the data association problem as a maximum weight 
independent set problem (MWISP) and further developed 
MHT for tracking. Later, Kim et al. [1] demonstrated that 
MHT can show competitive results by exploiting appear- 
ance models. In addition, Manafifard et al. [25] relied on 
particle swarm optimization (PSO) to account for nonlinear 
movements and occlusions in addition to appearance. The 
key of building a practical MHT-based tracker is to control 
the exponential growth of branches and to improve the 
efficiency and accuracy of pruning. We have done some 
work to improve the performance of MHT previously. In 
[26], we estimated the correlations between detections and 
improved the performance in distinguishing adjacent hy- 
potheses. In [27], we built a fused association graph to use 
both detections and superpixels to enhance the robustness 
of MHT when detector fails. In addition, we proposed a 
tracking-by-tracklet framework to improve the efficiency of 
MHT in [2]. However, it acquires technical tuning on the 
length of tracklets and the size of the tracking window to 
make a balance between speed and accuracy. 


Deep learning methods [28]-[31] have shown impres- 
sive results on single target tracking on both accuracy and 
efficiency. As for tracking multiple targets, these trackers 
cannot effectively deal with the identity switch problem 
due to the heavy and long-term occlusions. However, there 
are some other strategies of using deep learning. One 
commonly used approach is to describe the appearance of 
targets by using the features from Re-ID tasks [21], [32]. 
Compared with traditional feature descriptions like SIFT or 
HOG, these deep learning based features are more robust 
when calculating the similarity between targets. Another 
idea of using deep learning is to build an end-to-end track- 
ing network. There are some attempts such as [33], [34]. 
They input the video and output trajectories directly. The 


most severe shortcoming of these end-to-end trackers is the 
overfitting problem. The association relationship between 
targets is usually complicated and variable, so it is hard to 
design proper and enough data for training. 


3 CONTINUOUS CONSISTENCY MODEL 
3.1 Preliminary 


Similar to most of the tracking-by-detection methods, our 
proposed tracker in this paper also uses detections (or 
called observations) as input. Detections of the same target 
are associated to a complete trajectory, and trajectories of 
different targets are identified by different labels. For a 
specified detection, it can be described as a 4-dimensional 
vector d = (2,y,w,h), where x and y are the horizontal 
axis and vertical axis of the foot point (midpoint of bottom 
edge); w and h represent the width and the height of the 
detection. The trajectory of each target can be described as a 
set of detections from different frames T; = {d1, do,..., ds}, 
where d; can be a detected observation from the detector or 
an estimated observation from the tracker in frame t. 


3.2 Consistency Analysis 


In a real-world scene, for each individual target, its tra- 
jectory should be complete and continuous. In addition, 
most of the time there is a fact that pedestrians will nei- 
ther appear out of nowhere nor disappear suddenly in the 
scene (regardless of the extreme situations such as excessive 
concentration of obstacles, large changes in camera angle 
of view, low image resolution, etc. ). This means that there 
is correlation consistency among the detections between 
adjacent frames due to their continuous trajectories. In this 
paper, we propose a continuous consistency model (CCM) 
to describe this correlation among detections by classifying 
detections. 

For a given distance threshold, each detection could be 
associated to other detections in adjacent frames, or no 
detection meets the threshold. We consider four typical situ- 
ations to introduce the CCM in detail as shown in Figure 2. 

Figure 2(a) shows an ideal situation that a detection has 
both predecessor and successors. It means that d is a poten- 
tial extension for the previous frame and it has candidate 
detections in the next frame. Figure 2(b) shows another 
situation that a detection only has successors. It happens 
when the target does not appear in the scene (out of the 
scene or occluded) in frame t — 1 or its detections are missed 
until frame t. Figure 2(c) shows an opposite situation to (b). 
The target disappears in frame t + 1 or the detector is failed. 
In addition, there is another situation shown in Figure 2(d) 
that a detection has neither predecessor nor successor. In 
this case, d could be a false detection or the detections in 
both adjacent frames are missed by the detector. Thus, CCM 
can be described in the following form. 


D*  |dy,de_1| > 0, |de, di+1| > 0, 
D- |de,dy—1| > 0, |e, di+1| = 0, 
Idi, dt-1| = 0, |de, di1 | > 0, 
ldi, de_1| = 0, |dz, dyya| = 0. 


CCM() = (1) 


D? 


In Eq. 1, operator |d;, d;| means the number of connections 
between d; and d; when given a distance threshold. In this 
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Fig. 2. (a), (b), (c) and (d) show four typical situations for detection d 
(in blue) in frame t. Green boxes present detections in adjacent frames 
(t — 1 and ¢ + 1). Detections missed by detector are represented by 
dashed boxes. Detections are connected if the Euclidean distance of 
their foot points is less than the given threshold. 


way, we use CCM to classify detections into four categories. 
Detections that have predecessors and at least one successor 
are classified into D* (continuous detections); detections 
that only have predecessors are labeled as D- (left con- 
tinuous detections); detections that only have successors 
are labeled as Dt (right continuous detections); detections 
have neither predecessor nor successor are regarded as D° 
(discontinuous detections). 


3.3 Maximum Weight Bipartite Graph Matching 


To divide detections into four categories by CCM, we need 
to build the connections between detections in adjacent 
frames. For a given frame t, we can find a maximum 
matching between the weight bipartite graph of frame t — 1 
and t, and another maximum matching between frame t 
and t + 1. Therefore, CCM can be regarded as a series 
of maximum matching of weight bipartite graph problems 
through frames. 

For a given edge e = {d1, d2}, we learn the weight we 
by their distance and appearance features as follows. 
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where fı represents the Euclidean distance of the detections 
and fz is the cosine distance of their appearance features. 
We construct a 256-dimensional vector for each detection to 
describe its appearance. 

Finding maximum weight bipartite graph matching is 
a traditional problem and can be effectively solved by 
Hungarian algorithm. We describe our matching algorithm 


Fig. 3. Illustration of maximum weight bipartite graph matching. Four 
kinds of detections in frame t + 1 and t + 2 are shown in blue, yellow, 
green and red circles respectively. 


Algorithm 1 Matching Algorithm 


Input: weights of edges between frame t — 1 and frame t 
Output: Optimal matching 


Step 1. Define matrix Mnixn2, where nı is the number 
of target in frame t — 1 and ng is the number of target 
in frame t. The element e; j; in M is defined as 1 — wj,;, 
where w; j is the weight of edge {d;, d; }. 


Step 2. Add dummy rows or columns with zeros to make 
M square. 


Step 3. Every row subtracts its smallest element, so that 
there is at least one zero in each row. 


Step 4. Every column subtracts its smallest element, so 
that there is at least one zero in each column. 


Step 5. Cover all zeros with a minimum number of lines. 
If the number of lines equals the size of M, an optimal 
matching is found, go to Step 7. 


Step 6. Find the smallest element that is not covered by the 
lines in Step 5. Subtract it from all uncovered elements, 
and add it to all elements at the intersections of the lines. 
Then, go to Step 5. 


Step 7. Choose zeros in different rows and columns, so 
that the corresponding elements in the original matrix 
(M) is the optimal assignment. 


TABLE 1 
Attribute Description 
Lt, Yt, Wt, hy | Detection’s information of the node in frame t 
Ct Confidence of the detection in frame t 
at Appearance of the detection in frame t 
Cm Highest confidence of the detection 
in the hypothesis before frame t 
am Appearance of the detection with the highest 
confidence before frame t 


as Alg. 1, and its time complexity is O(n?). According to 
the definition of CCM in Eq. 1, nodes that have matching 
between both previous and next frames are classified as D*; 
nodes that only have matching between previous frame are 
D~; nodes only match other nodes in the next frame are 
D*; isolate nodes are defined as D°. As shown in Figure 3, 
detections are divided into four categories according to the 
matching results. 


4 OPTIMIZATION FOR MULTIPLE HYPOTHESIS 
TRACKING 
4.1 MHT Overview 


MHT is a breadth-first search algorithm. It solves tracking 
problem by generating multiple trees and evaluating each 
branch with a similarity score, and then selecting the most 
promising trajectories. The node in the track proposal is a 
detection from the detector or an estimated dummy node. 

There are four main processes in the MHT, constructing, 
updating, scoring and conflict eliminating. First, new trees 
are constructed in each frame. The root node of the new 
tree is the detection in the frame, representing a new track 
proposal. Secondly, existing trees are expanded with the 
new coming detections in the frame. Meanwhile, trees are 
also extended with dummy nodes. Then, every leaf node is 
scored to evaluate the branch. However, as every detection 
should only represent one target, the nodes of the same 
detection in different trees may have a conflict. Finally, 
to address this problem, Maximum Weighted Independent 
Set (MWIS) algorithm is used to find the best set of the 
proposals [24]. The score of the proposal weights the edge 
in the MWIS problem, and only selected proposal are kept 
and updated in the next frame. 

In this paper, we mainly focus on the first two parts, 
construction and updating, in which the scale of trees grows 
exponentially. We present a novel framework for MHT as 
shown in Figure 4 and how CCM makes contributions to 
controlling the scale of the tracking trees. The attributes 
of the nodes that we use in this paper are summarized in 
Table 1. 


4.2 CCM for Tree Construction and Updating 


The goal of MHT is to keep as much potential hypotheses 
as possible and make decisions later. However, due to the 
growing scale, traditional MHT-based trackers adopt so- 
phisticated pruning heuristics to avoid the scale from being 
unacceptable. As a result, lots of hypotheses are pruned 
improperly. 

Construction and scoring are two important processes 
that affect the scale of track trees. In addition to pruning 
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Fig. 4. Framework of MHT with continuous consistency model. Detections are categorized into four types before constructing and updating track 


trees. 

t ( ) 
nl O na: 
t+2 on a 
n3 a ò © > 


. T 
Lè éd Ce 


©) D* (Continuous node) 


o) D: (Left continuous node) O D+ (Right continuous node) © D? (Discontinuous node) ( X Dummy node 
We gk 


Fig. 5. Constructing and updating process of the track trees with CCM. As depicted in the legend, four kinds of detections and dummy nodes are 
shown. Only D+ and D° are used as the root nodes, while only D7, D° and dummy nodes are extended with dummy nodes. 


strategies and parameters, a large number of dummy nodes 
also significantly lead to the rapid expansion of trees. Based 
on this aspect, we propose an approach to control the 
number of track trees and the dummy nodes by CCM and 
thus reduce the scale of MHT. 


Unlike traditional approaches, we do not construct a 
new tree for every detection, only detections belonging to 
D* and D° are built as the root of the trees. According 
to the discussion of CCM, D~ and D* detections are most 
likely extensions of other detections in the previous frame. 
Therefore, it is a reasonable decision to not construct new 
trees for them. 


As shown in Figure 5, another difference from the tra- 
ditional approaches is the updating strategy. Not all leaf 
nodes are extended with dummy nodes, only if a leaf 
detection belongs to D~ or D®, or a leaf node is a dummy 
node. As defined by CCM, D* or D* detection has great 
possibility to have a potential extension in the next frame. 
Extending Dt or D* detections with dummy nodes is 
more likely to introduce interference terms than generate 
possible hypotheses. In addition to dummy nodes, each leaf 
node is extended with new coming detections that satisfy 
the distance threshold. The distance between detections is 
defined as the Euclidean distance of their foot points as 
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Fig. 6. Illustration of three kinds of nodes in frame & for the calculation 
of the scale of the track trees. 


follows. 


di2= Via — z2)? + (yi — ye)? (3) 


Dummy nodes share the same width, height and appear- 
ance features as their parent nodes. Their locations are 
estimated by linear interpolation. We have made consid- 
eration of not using more complex methods such as non- 
linear regression or Bayesian estimation. The location of the 
pedestrian is decided by the motion of itself and the camera. 
The accuracy of the detectors has been greatly promoted in 
the recent past. In 5,919 frames of MOT17Det [35], there 
are only 7,599 false positives in SDP detector and 10,081 in 
FRCNN detector, compared with 42,308 in DPM proposed 
in 2010. Sudden changes of the location and the size of the 
detections hardly happen by using advanced detectors. As 
a result, using more complex methods does not provided 
a much more accuracy prediction. In addition, there is a 
fact that in scenes taken by moving cameras, the motion 
of the cameras is almost unpredictable. Based on these two 
points, as well as location prediction is not the focus of this 
paper, we decided to use a simple method for prediction to 
improve the efficiency of the tracker. 


4.3 Scale Analysis 


We now discuss the difference in the scale of the track 
trees with and without CCM theoretically. In MHT-based 
trackers, Bip is a tunable parameter to control the maximum 
number of tree branches. As it only works in few extreme 
situations, we do not consider its influence in the following 
analysis (regarding Bin as +00 ). 

First, we discuss the scale of track trees in traditional 
MHT. For a giving frame, there are three kinds of nodes 
as shown in Figure 6. They are indicated as dp, €p and rp 
respectively in Eq. 4. The total number of nodes in frame 
k is represented as ng. Nj, is the number of all nodes from 
frame 1 to k. They are calculated as follows. 


nk = dk + ek + Tk 
= Nk—-1 t ek t Tk 


k k 
=n + J ei + J fri 
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Then we analyze the contribution of CCM for MHT. 
We use n, and N, to represent the number of nodes 
to avoid ambiguity. When using CCM for updating and 
constructing, only left continuous (D~) and discontinuous 
detections (D?) are extended with dummy nodes, and only 
right continuous (D+) and discontinuous detections (D?) 
are used as the root nodes. Compared with traditional MHT, 
the number of nodes is decreased according to the result 
of bipartite matching in CCM. We use p to represent the 
probability of finding a succeed for a node (Dt and D*). 
Obviously, it is the same probability of finding a preorder for 
anode (D~ and D*) because the matching process between 
adjacent frames are done at the same time. Hence, n, and 
N, can be calculated as follows. 
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Comparing Nx and N, „, the scale of track trees is de- 
cided by the matching rate of CCM which is significantly 
influenced by the detector. The quantitive comparison are 
presented in the experiment section. 


4.4 Scoring 


During construction and updating, we mark the node in 
each branch that has the highest confidence (provided by 
detectors). The score of the branch is recursively defined as 
follows. 

So = 0 


St = St-1 + Smot + Sapp + SappM 


(6) 


where Smot denotes the physical distance between detec- 
tions, calculated as same as fı in Eq. 2. The latter two parts 
evaluate the appearance similarity locally and globally. Sapp 
is the normalized cosine distance between a; and its parent 
at—1, ANd Sappm is the normalized cosine distance from a; 
tO Gm. 


5 EXPERIMENTS 
5.1 Datasets and Metrics 


Our proposed tracker are evaluated on both MOT Challenge 
2016 [35] and 2017. It is a widely used benchmark for multi- 
target tracking. There are 14 sequences (7 training, 7 test) 
with 11,235 frames in MOT 2016, and 42 sequences (21 
training, 21 test) with 33,705 frames in MOT 2017. MOT 2017 
consists of the same video as MOT 2016 but has different 


sets of detections for each video by three detectors including 
DPM [36], FRCNN [37] and SDP [38]. 


All the detections used in the experiments are provided 
by MOT benchmark for a fair comparison. 


We adopt the CLEAR MOT metrics [39] for quantitive 
evaluation. MOTA* (multiple object tracking accuracy) is a 
combination of FP| (false positives), FN} (false negatives) 
and IDS} (identity switches). IDF1+ [40] is the ratio of 
correctly identified detections over the average number of 
ground truth and computed detections. MOTA and IDF1 
are two important metrics to evaluate trackers. The former 
is primarily concerned with whether the targets are tracked, 
while the latter focuses on whether the targets are labeled 
with correct ID. When MOTA shows similar results, IDF1 
is more capable to evaluate trackers on tracking targets 
consistently. In addition, MTt (mostly tracked, > 80%), MLĻ 
(mostly lost, < 20%), track fragmentations (FM){ and Hzt 
(processing speed, frames per second) are also reported. The 
indicator + means the higher the better and | means the 
lower the better. 


5.2 Implementations 


Considering that the elapsed time comparison is one of 
the major part in this paper, we run trackers under the 
same hardware configuration as follows, Intel Core i7- 
8700K@3.7GHz, 32GB DDR4-2400MHz, 1TB SATA3 HDD- 
10000rpm. Unless otherwise indicated, we use the same pa- 
rameters in their paper for MHT_DAM [1] and TLMHT [2] 
for comparison. Specifically, Nscan pruning parameter N 
= 5, maximum number of tree branches Bin = 100 and 
distance threshold dp = 12 for both Eq. 1 and Eq. 3. In 
addition, we extract convolutional neural network features 
from GoogLeNet [41] as the appearance features of the 
detections. 


5.3 Effectiveness Analysis 


One of the main goals of this paper is to promote the 
effectiveness of MHT by controlling its scale. In this sec- 
tion, we compared our proposed CMT tracker with other 
MHT-based trackers, including TLMHT [2], HAF [27] and 
MHT_DAM [1]. 


HAF and MHT_DAM are designed based on tracking- 
by-detection framework. Branches are extended node by 
node according to the detections in each frame. TLMHT 
is a tracking-by-tracklet tracker as it extends its branches 
by tracklets instead of a single detection. However, all 
these trackers are extended with dummy nodes to represent 
missing detections for every branch, and every detections 
(or tracklets) are constructed as new trees in the coming 
frame. In contrast, CMT tracker only extends dummy nodes 
to D~ and D°, and chooses D+ and D° as the root nodes. 


TABLE 2 
Number of Dummy Nodes on Different Detectors 


Detector Method Dummy Total Ratio 
Ours 415,664 742,257 56.0% 

DPM TLMHT [2] 264,357 480,467 55.0% 
HAF [27] 2,311,775 4,444,972 52.0% 

MHT_DAM [1] 2,057,198 3,609,119 57.0% 

Ours 169,448 305,858 55.4% 

TLMHT (len=3) [2] 134,471 249,112 54.0% 

FRCNN HAF [27] 1,211,902 2,372,264 51.9% 
MHT_DAM [1] 1,040,795 1,763,691 59.0% 

Ours 267,722 520,841 51.4% 

SDP TLMHT (len=3) [2] 248,742 441,728 56.3% 
HAF [27] 2,033,271 3,597,202 56.5% 

MHT_DAM [1] 1,906,411 3,280,139 58.1% 


To verify the effectiveness of CCM on controlling the 
scale of the track trees, we count the number of nodes during 
tracking with different detections. As shown in Table 2, the 
number of the dummy nodes accounts for about half of all 
nodes (Ratio column). CMT tracker dramatically reduces the 
number of nodes. There are only 20.6%, 17.3% and 15.9% 
nodes compared with MHT_DAM. 


TABLE 3 
Recall Rate of Different Detectors 


Detector TP FN TP+FN Recall rate 

DPM [36] 78,007 36,5578 114,564 68.1% 
FRCNN [37] | 88,601 25,963 114,564 77.3% 

SDP [38] 95,699 18,865 114,564 83.5% 


As discussed in Sec.4.3, the number of nodes in CMT 
tracker is mainly influenced by the performance of detector. 
We list the recall rate of different detector in Table 3. It shows 
that p in Eq. 5 is directly proportional to the recall rate. It 
means that higher recall rate of the detector results in more 
significant reduction on nodes by CMT tracker. 


TP 
TP+ FN 


In addition to the fewer nodes, we achieve better track- 
ing results on both MOT 2016 and MOT 2017 as shown in 
Table 4 and Table 5. Compared with MHT_DAM, by remov- 
ing the constraints on the maximum number of consecutive 
dummy nodes, we have stronger ability to track targets 
under long-term occlusion where detections are missed. We 
reduce IDS by 128 on MOT 2016 and 956 on MOT 2017, 
therefore improve 8.2 and 8.1 on IDF1, 4.0 and 3.9 on MOTA 
respectively. 

The continuity of the trajectories is an important con- 
straint in tracking task, but traditional MHT methods do 
not consider the continuity relationship among detections 
between adjacent frames when constructing and expanding 
tracking trees. CCM is used to classify the detections into 
different categories and therefore make constraints for de- 
tections to describe the continuity of trajectories. As a result, 
we not only reduce the number of nodes in the tracking 
trees, but also suppress the generation of wrong branches at 
the same time. 

Figure 7 shows an example of tracking long-term occlu- 
sion targets. Three persons (tagged with red, purple and 
green in Figure 7(e)) are occluded by another person (tagged 


(7) 


p x recall rate = 


(a) Detections 


(e) Ours 


Fig. 7. Qualitative tracking results on MOT17-SDP-01 downloaded from MOT website. Three keyframes (frame 40, 50, 60) are shown in the figures. 
The public detections provided by MOT Challenge 2017 are shown in (a). Tracking results are presented in (b), (c), (d) and (e). 


TABLE 4 
Results on MOT 2016 Train 


Method IDF1f MOTAtT MT? ML FPL FN IDS} FMI Dummy Total Ratio 
Ours 57.4 44.3 92 236 4,941 56,348 175 338 415,664 742,257 56.0% 
TLMHT [2] 55.0 42.2 79 268 5,398 58,305 157 308 264,356 480,467 55.0% 
HAF [27] 54.3 41.7 93 240 7,265 56,916 192 313 2,311,775 4,444,972 52.0% 
MHT_DAM [1] 49.2 40.3 88 230 5,401 60,167 303 412 2,057,198 3,609,119 57.0% 
TABLE 5 
Results on MOT 2017 Train 
Method IDF1* MOTAt MIT ML} FPL FN IDS} FM} Dummy Total Ratio 
Ours 59.1 54.6 486 568 12,266 139,763 759 1,203 1,035,871 2,031,119 51.0% 
HAF [27] 56.8 52.2 463 543 11,854 146,242 1,214 1,474 7,644,139 14,415,778 53.0% 
TLMHT [2] 56.4 51.2 338 714 11,410 152,443 625 1,023 935,893 1,747,020 53.6% 
MHT_DAM [1] 51.0 50.7 422 566 11,743 150,667 1,715 1,520 7,186,727 13,559,862 53.0% 
TABLE 6 
MOT 2016 Sequences 
Name FPS Platform Method IDFif MOTAt MIT ML) FP] FN IDS) FMI 
Ours 40.5 30.0 8 26 605 11,856 31 43 
MOT16-02 30 Static TLMHT [2] 40.2 26.3 9 35 405 12,724 12 37 
MHT_DAM [1] 38.1 25.2 8 29 654 12,644 48 55 
Ours 63.5 51.5 14 26 2,299 20,715 55 103 
MOT16-04 30 Static TLMHT [2] 63.6 53.2 17 24 2,077 20,122 50 96 
MHT_DAM [1] 52.1 45.6 14 29 2,543 23,214 116 147 
Ours 54.0 40.5 20 60 250 3,791 13 37 
MOT16-05 14 Moving TLMHT [2] 32.4 25.6 6 77 588 4,458 27 39 
MHT_DAM [1] 45.7 39.3 20 54 269 3,849 20 41 
Ours 64.8 62.7 10 4 438 1,487 34 30 
MOT16-09 30 Static TLMHT [2] 64.5 56.1 10 5 374 1,922 14 24 
MHT_DAM [1] 61.4 58.3 10 4 377 1,773 42 37 
Ours 55.1 44.2 12 26 433 6,425 20 63 
MOT16-10 30 Moving TLMHT [2] 52.1 37.0 8 31 651 7,096 15 51 
MHT_DAM [1] 49.9 41.0 9 25 585 6,642 39 61 
Ours 67.7 54.6 17 34 473 3,680 9 16 
MOT16-11 30 Moving TLMHT [2] 67.2 54.5 19 30 736 3,409 25 27 
MHT_DAM [1] 62.5 54.1 16 31 477 3,714 21 23 
Ours 38.7 22.7 11 60 443 8,394 13 46 
MOT16-13 25 Moving TLMHT [2] 33.8 20.0 10 66 567 8,574 14 34 
MHT_DAM [1] 35.2 22.7 11 58 496 8,331 17 48 


with orange in Figure 7(e)) from frame 40 to 60. Their 
detections are missed by detectors due to the occlusion, 
so dummy nodes are expected to keep the hypotheses. 
However, MHT_DAM and TLMHT fail to track them due 
to the constraint on dummy nodes. To control the scale 
of trees, trajectories of more than 15 consecutive dummy 
nodes are discarded. In contrast, as the number of nodes is 
reduced by CCM dramatically, we do not have to worry the 
scale anymore and remove the constraint on dummy nodes. 
As a result, our method successfully tracks all of the three 
persons as shown in Figure 7(e). 


5.4 Robustness Analysis 


Compared with TLMHT, although CMT tracker has more 
nodes, it achieves better performance on both IDF1 (2.4 and 
2.7) and MOTA (2.6 and 3.4) on MOT 2016 and MOT 2017, 
shown in Table 4 and Table 5. In this section, we further 
discuss the reason of the improvement. 

MOT 2016 contains videos of different FPS (Frame Per 
Second) and platform (moving and static cameras). The 
performance of TLMHT is severely influenced by the quality 


of the tracklets. We make the detailed comparison among 
CMT, TLMHT and MHT_DAM as shown in Table 6. 

For videos with high FPS, CMT and TLMHT have 
similar performance on IDF1; while for videos with low 
FPS, CMT remarkably outperforms TLMHT. On MOT16-05, 
a video with low FPS taken by moving camera, TLMHT 
performances even worse than MHT_ DAM. It has lower 
IDF1 and MOTA than both CMT and MHT_DAM, while the 
MT is fewer than a half of them. In low FPS videos, targets 
are less coherent between adjacent frames, so it is difficult 
to generate reliable tracklets with expected length. The low 
quality tracklets directly pull down the performance of 
TLMHT. In contrast, CMT has better robustness on different 
types of videos. 


5.5 Computational Time Analysis 


Low computational efficiency and excessive time consump- 
tion have always been the fundamental problems for MHT 
methods. We make the following comparison experiments 
to analysis the computational efficiency of CMT tracker. 
TLMHT is a tracking-by-tracklet tracker, and the length of 
the tracklets greatly influences its performance and com- 


putational efficiency. The time of generating tracklets is 
included and we set the length of the tracklet to 3 and 5 
for comparison. 


TABLE 7 
Computational Time 


Datasets Method time (s) 
Ours 658.0 
TLMHT (len=3) [2] 1579.3 
MOT 2016 | TEMHT (len=5) [2] 2043.4 
MHT_DAM [1] 6836.4 
Ours 2072.1 
TLMHT (len=3) [2] 5984.9 
MOT 2017 | TEMHT (len=5) [2] 75713 
MHT_DAM [1] 18442.1 
Compared with MHT_DAM in Table 7, our method 


takes only 9.6% of the time to complete the tracking tasks 
in MOT 2016 and 11.2% at the time in MOT 2017. As for 
TLMHT, we also complete faster even by setting the length 
of tracklets to 3 for better efficiency. 


5.6 Benchmark Comparison 


Table 8 shows the results on MOT Challenge 2016. MOTA 
and IDF1 are two aggregative metrics to evaluate the per- 
formance of trackers. Our proposed CMT tracker takes 
the first place sorted by IDF1 score (56.6) and the third 
place sorted by MOTA (48.1). Compared with MHT_DAM, 
CMT16 outperforms it by 2.3 on MOTA and 10.5 on IDF1. 
IDS decreases from 590 to 381 which proves our tracker is 
more effective to keep possible hypotheses. As for TLMHT, 
our method achieves similar score on MOTA and promote 
IDF1 by 1.3. 

In the more recent MOT Challenge 2017, tracking results 
are shown in Table 9. Compared with other MHT-based 
trackers, CMT gets similar score on MOTA while shows 
state-of-the-art performance by achieving the best score on 
IDF1, FN, IDS, FM and Hz. 

The experimental results on both benchmarks show that 
our method is effective to achieve competitive tracking 
results while solving the efficiency problem of MHT. 


6 CONCLUSION 


In this paper, we propose the continuous consistency model 
(CCM) to categorize detections into four types, continu- 
ous, left continuous, right continuous and discontinuous 
detections. Unlike previous MHT tracking methods, we 
only extend the left continuous and discontinuous detec- 
tions with dummy nodes, and choose right continuous and 
discontinuous detections as the root nodes. In this way, 
the exponential growth of the trees has been effectively 
controlled. In addition, we remove the constraint on dummy 
nodes to generate more complete trajectories when long- 
term occlusion happens. Our proposed CMT tracker shows 
dramatically improvement on the computational efficiency 
while achieving state-of-the-art results on MOT Challenge 
benchmark. 
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